Matteo Ceccarello
University of Padova
matteo.ceccarello@unipd.it
Joint work with Johann Gamper (U. Bolzano)
Fix a subsequence length \(w\)
This timeseries is from an experiment recording the behavior of a Circulifer tenellus insect. In particular, the subsequences below are associated with a particular behavior while feeding.
Definition 1 Given a time series \(T\) and a length \(w\), the top-\(k\) motifs are the \(k\) pairs of subsequences of \(T\) of length \(w\) with smallest distance, such that no two subsequences in any pair overlap with each other.
Attimo, a randomized algorithm for discovering the exact top-\(k\) motifsAttimo auto tunes its parameters to the data distributionAttimo allows to control the success probability, i.e. its recallAttimo can handle time series of one billion values in just half an hourAlgorithm
Pros
Cons
Subsequences of length \(w\) are points in \(R^w\),
with Euclidean distance.
\[ h(\vec{x}) = \left\lfloor\frac{\vec{a} \cdot \vec{x} + b}{r}\right\rfloor \]
The key point is that we only compute the distance of subsequences falling
into the same bucket.
To lower the collision probability we concatenate \(\tau\) hash functions \[ \hat{h}(\vec{x}) = \langle h_1(\vec{x}), \dots, h_\tau(\vec{x}) \rangle \] this makes for a better precision of the index.
And to increase the recall of the index we repeat \(L\) times.
In each iteration we compute the distance of all subsequences in the same bucket.
In the first iteration, with \(k=4\), we discover a pair at distance 2
.
After 10 repetitions, we find a pair at distance 1
.
After 100 repetitions we did not find a better pair,
and the success probability is about 0.85
We then consider shorter prefixes of the hashes,
going through the 100 repetitions again.
After 15 repetitions, we observe that the
success probability is above our target, and thus return.
Theorem 1 Our algorithm finds the exact top-\(k\) motifs each with probability \(\ge 1-\delta\).
Theorem 2 The LSH index construction takes time \[ O(\tau_{max} \cdot \sqrt{L_{max}}\; n\log n) \]
Theorem 3 Let \(d(m_k)\) be the distance of the \(k\)-th motif, and \(i'\le \tau_{max}\), \(j' \le L_{max}\) be the parameters used by the optimal LSH algorithm. Then, the algorithm considers \[ O\left( j'\sum_{m\in T^w\times T^w} p(d(m))^{i'} + (L_{max}-j')\sum_{m\in T^w\times T^w} p(d(m))^{i'+1} \right) \] pairs in expectation.
Find the top-10 motifs.
| dataset | \(n\) (millions) | RC |
|---|---|---|
| astro | 1 | 8.63 |
| GAP | 2 | 9.17 |
| freezer | 7 | 7.95 |
| ECG | 7 | 109.06 |
| HumanY | 26 | 581.03 |
| Whales | 308 | 21.66 |
| Seismic | 1000 | 274.44 |
Relative Contrast measures difficulty: higher is easier. \[ RC = \frac{d_{avg}}{d_{motif}} \]
Synthetic data, planted motif.
For shorter time series, or if the relative contrast is very small, use the Matrix Profile.
For time series of a few million values and above, with a not-to-small relative contrast, try Attimo
Attimo gives you control over the recall, and adapts to the data distribution
https://github.com/Cecca/attimo
matteo.ceccarello@unipd.it
Running for top-10 motifs, for different number of repetitions.
Data from LIGO:
Attimo: 6 hoursSCAMP: \(\approx 1\) secondfreezer and the 7-th motif7M points, window 5000